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Abstract 

Recent developments in extracting and processing biological and clinical data are 
allowing quantitative approaches to studying living systems. High-throughput sequenc- 
ing, expression profiles, proteomics, and electronic health records are some examples of 
such technologies. Extracting meaningful information from those technologies requires 
careful analysis of the large volumes of data they produce. In this note, we present 
a set of distributions that commonly appear in the analysis of such data. These dis- 
tributions present some interesting features: they arc discontinuous in the rational 
numbers, but continuous in the irrational numbers, and possess a certain self-similar 
(fractal-like) structure. The first set of examples which we present here are drawn from 
a high-throughput sequencing experiment. Here, the self-similar distributions appear 
as part of the evaluation of the error rate of the sequencing technology and the iden- 
tification of tumorogenic genomic alterations. The other examples are obtained from 
risk factor evaluation and analysis of relative disease prevalence and co-mordbidity as 
these appear in electronic clinical data. The distributions are also relevant to identifi- 
cation of subclonal populations in tumors and the study of the evolution of infectious 
diseases, and more precisely the study of quasi-spccies and intrahost diversity of viral 
populations. 

1 Introduction 

The large volumes of data obtained by recent technological developments, such as high- 
throughput sequencing and expression profiles, are providing novel and complementary 
ways to studying biological systems. In order to extract meaningful, statistically significant 
information from such data, mathematical methods are being developed, implemented, 
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Figure 1: Left: Thomae's function, a self-similar function over the rational numbers in 
the unit interval. Center /right: Distributions of ratios from heterozygous single nucleotide 
polymorphisms data from the sequencing of a cancer genome. 



and tested in various contexts. For example, it is believed that most tumors are due to 
somatic mutations that lead to an uncontrolled cell growth. High-throughput sequencing 
technologies produce hundreds of gigabases of genetic data, providing a way to identify 
genes responsible for the tumorigenic process by comparing the genome of the tumor and 
the normal tissue |Bmgell09j IMardislOl IDinglOj ISalklOl IVTiTOl IPaslOj . 

In this note, we point out some interesting properties of the ratios of natural numbers 
obtained in a biological/clinical setting. The ratios of interest can be seen as sampled from 
a distribution over the rational numbers in the unit interval. Consider pairs of positive 
integers, n and m, sampled from a distribution with probability f(n,m). The ratio q = 
n/(n + m) of one of these numbers by the sum of the two is a rational number in the unit 
interval. In this way the distribution f(n,m) gives rise to a distribution g(q) supported 
on the rational numbers in the unit interval. A case of particular interest is when the two 
integers are drawn independently from the same distribution h(n). As we are going to 
see, in this case and for h being certain common distributions, such as exponential and 
power-law, it is possible to have a closed-form expression for g. We will also see that the 
resulting distributions over the rational numbers possess certain self-similarity properties. 
Namely, the overall shape of those distributions is similar to Thomae's function (Figure 
[TJ left). Although irrelevant to our discussion we would like to point out that, similar to 
Thomae's function, the distributions which we study are rather interesting analytically, 
because, viewed as functions over the reals, they are continuous on the irrational numbers 
but not on the rationals. 

We will illustrate the appearance of such distributions in real life data with two exam- 
ples: 1) a high-throughput experiment aimed at identifing genomic variations in cancers 
and 2) diagnosis information data collected at the New York Presbyterian Hospital in sev- 
eral consecutive years. Although the presence of irregular shapes and spikes in empirically 
occuring distributions of ratios of natural numbers was reported before as a statistical 
artifact | John95| . the authors of this previous work failed to acknowledge the interesting 
mathematical structure of the underlying distributions. In this work we propose the study 
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of those naturally occurring distributions of rational numbers as an interesting mathemat- 
ical topic with important clinical and biological applications. 

2 First Example: Identifying Genomic Alterations with 
High-throughput Sequencing 

Our first example comes from a high-throughput sequencing experiment of a diffuse large 
B-cell lymphoma (DLBCL) sample. DLBCL is the most common B-cell non-Hodgkin 
lymphoma in adults, accounting for ~40% of all new lymphoma diagnoses. Tumor DNA 
was extracted from a nodal tumor of a 63 year old female patient. The coding part the 
genome (the exome) was enriched using Roche NimbleGen Sequence Capture and the 
enriched product was sequenced using Roche 454 sequencing. The data produced from 
the experiment were 2 • 10 6 reads (sequences of DNA) of average length 250 nucleotides. 
The reads were aligned to the hgl8/NCBI36.1 reference human genome. This resulted in a 
coverage of about lOx of the human exome and the alignment was used to identify genomic 
variants distinguishing normal and tumor cells. 




Figure 2: The human genome is diploid with two strands per chromosome. The reads 
covering a position of the genome can originate from each of the four strands. For every 
position, the ratio between the number of reads from one of the strands to the total 
number of reads from the chromosome and the ratio between the number of reads from the 
chromosome to the total number of reads covering the position are rational numbers. The 
distribution of each of these ratios follows a self-similar distribution. 

Figure [3] (left, blue) shows the depth (=number of reads covering a particular position) 
distribution (coverage) after alignment of the reads. The figure also shows in red a neg- 
ative binomial least-square fit of the data. If the reads were obtained from the genome 
independently and at random, one would expect the coverage to follow a Poisson distri- 
bution. As it is, even though restricted to a small part of the genome the coverage might 
be Poisson, overall, because of the way the sample was processed before sequencing, the 
means of the Poisson processes in different parts of the genome will vary. The result will 
be an overdispersion of the depth distribution and a better fit by the negative binomial, 
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known to be a mixture of Poisson distributions with Gamma-distributed means. 




Figure 3: Left: coverage in the cancer sequencing experiment. Center: coverage of the two 
copies of the cancer genome. Right: coverage of the two strands of a fixed copy of the 
cancer genome. 

Each of the 46 chromosomes of the human genome has two strands and, with the excep- 
tion of the sex chromosomes X and Y, the human genome is diploid, i.e. each chromosome 
has a homologous copy. Since the reference genome is given as entirely haploid, the in- 
formation about which copy of the genome a sample read originates from is not recovered 
by the alignment. Nonetheless, assuming that a read can originate from each copy of the 
genome with equal probability and given the coverage of the reference, one can obtain a 
theoretical coverage of a fixed copy of the genome. Thus the fraction of positions on a fixed 
copy of the genome covered with k reads should be 

p(fc) = f>)Q2-*, 

where q(t) is the fraction of positions with coverage t, as given in Figure [3] (left, blue). After 
a simple algebraic simplification it can be shown that, if q is Poiss(A), then p is Poiss(A/2). 
Furthermore, since the negative binomial is a mixture of Poissons with Gamma-distributed 
means, we can obtain that if q is NegBin(r, s), then p is NegBin(r, (s/2) /(l — s/2)). Figure 
[3] (left, green) shows the theoretical coverage of a fixed copy of the human genome obtained 
from these considerations. Similar reasoning leads us to a predicted coverage of a fixed 
strand of the human genome shown in Figure [3] (left, black). 

Although the alignment to the reference does not provide exact information about the 
origin of a read in the sample, we can still test the prediction about the coverage of a 
fixed copy of the cancer genome in the following way: take sufficiently many heterozygous 
positions, i.e. positions at which the two copies of the genome differ, and then consider the 
number of reads covering such a position and containing one of the variants at that position 
and the number of reads containing the other variant. Those two depth distributions should 
be close to the predicted distribution of the coverage of a fixed copy of the genome. Figure 
[3] (center, blue and red) shows the result of these considerations. Here we took only 
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the positions of exonic single nucleotide polymorphisms documented in the NCBI's dbSNP 
database, which are covered sufficiently well in the experiment (total of ~3000 heterozygous 
positions). Figure [3] (center, green) contains the predicted coverage of the two copies of 
the human genome as obtained earlier. Furthermore, Figure [3] (right) shows similar plots 
for a fixed strand of the genome. Since the information about the strand from which a 
sample read originates is also lost in the sequencing, here we used the orientation of a read 
when aligned to the reference as a surrogate for its strand. As can be seen, the predictions 
closely follow the data, confirming our intuition that the reads should come from the four 
strands of the genome independently. 

Our main observation is concerned with the heterozygous positions we used to obtain 
the data for Figure [3] (center/right). This time we consider the distribution of the ratios 
of the number of reads covering one of the variants at a particular position in the cancer 
genome to the total number of reads covering this position and the ratio of the number of 
reads covering one of the strands to the total number of reads covering the variant. The 
resulting distributions of ratios are given in Figure [T] (right and center, blue). There are 
two apparent features of the distributions which drew us to studying them: first, their 
fractal-like self-similar structure, and second, the spikes they contain. We consider the 
topic of the self-similarity of the distributions in Appendix A and quantify it by computing 
the fractal dimension of related functions. From a biological point of view the spikes are 
interesting because at first sight one might decide that they show overrepresentation of 
certain ratios. For example, for the distribution of variant depth over the total depth, 
the spike at 0.5 is expected, since we are looking at heterozygous positions, but the spikes 
at 0.33 and 0.66 are harder to explain biologically since they would mean the significant 
presence of variants with ploidity other than 2. While such phenomena can occur in cancers 
because they can present genome aberrations known as copy number alterations, the scale 
at which the phenomenon is represented here is unusual. We will see that the spikes are 
due to the discreteness of the data and could actually be explained by a simple stochastic 
model. Hence regarding the biological conclusions one can draw from high-throughput 
sequencing experiments, the message of our note is that when dealing with biological data 
the stochastic effects due to the discreteness of the data can be big and attention should be 
used when drawing conclusions lest one confuse such effects with real biological phenomena. 
A similar conclusion was drawn in [John95| . In this note we further study the mathematical 
properties of the resulting distributions. 

To formalize the situation we first define the convolution over the rational numbers of 
two functions defined over the natural numbers. Let 

Q u = Q n [0, 1] = {a/ {a + b) : a <G N, b G N, a + b > 0, (a, b) = 1} 

be the set of rational numbers in the unit interval. For any two functions /, g : N — )■ K 
define their convolution Cf g : Q u — > E to be 
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OO OO / \ oo 

c ftB (a/(a + 6)) = EE /H 5 UT6 - ^ = £ 

m=0ra=0 ^ ' t=l 

Note that, if / and g are distributions over N, then Cf :9 is a distribution over Q u . 

In Figure [T] (center, red) we have also plotted the convolution c p>p of the negative- 
binomially distributed predicted coverage p of the two copies of the cancer genome as given 
in Figure [3] (center, green). In Figure [T] (right, red) we have done the same for the coverage 
of a fixed strand. As can be seen, the convolutions closely follow the empirical distributions 
of ratios. This observation is consistent with the null- hypothesis of reads originating from 
the four strands of the human genome independently and covering a particular position 
on the genome with a negative-binomial distribution. No further assumption seems to be 
necessary to explain the irregular shapes of the ratio distributions. 

We would like to finish the exposition in this section by noting that the observed 
structures are not particular to the Roche 454 sequencing technology and can be observed in 
sequencing experiments performed with other sequencing platforms, e.g. Illumina's Solexa 
and Life Technologies' SOLiD. 



3 Second Example: Electronic Clinical Data 

The development and implementation of electronic clinical records has made available large 
amounts of longitudinal clinical data. The primary application of electronic clinical data is 
to improve the quality of health care provided to the individual patients. Although using 
this data for uncovering large scale correlations and trends comes secondary to this, the 
impact such data mining will have on the public health is indisputable. Some specific areas 
which will be influenced by such analyses are the creation of alert systems for emerging 
infectious diseases, identification of populations at risk, and measuring the efficacy and 
efficiency of public health measures. A recent example of this is provided by the 2009 H1N1 
influenza pandemic. The first wave of the new influenza strain infected a considerable part 
of the world population at the end of spring 2009 and the beginning of the summer 2010 
[Fras09, Anzic09]. Evaluating the impact of the new pandemic strain on the public health 
involved analyzing large clinical datasets |Jami091 ICowllO] IKhialO"] . 

The New York Presbyterian Hospital has an electronic repository with the longitudinal 
clinical records of more than 2 million patients. An example of the large scale analysis 
enabled by this data is the identification of populations that are at higher risk of morbid- 
ity/mortality from the new pandemic influenza virus versus seasonal influenza, for instance, 
people with asthma, children, pregnant women, etc. [KhialO] The approach we took for 
this analysis was to compare the number of people with a given condition who were af- 
fected by seasonal or pandemic influenza at different time points. Towards this goal, for 
every two diseases identified by their ICD9 codes, we can obtain from the electronic health 
records the number of people who have been affected by both diseases. Although this 
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might differ from the established terminology, for the purpose of this note we will call this 
number the co-morbidity of the two diseases. In this way for a fixed disease we can obtain 
its co-morbidity with all other possible diseases. If we do this for two diseases, which in 
our analysis we take to be seasonal and pandemic influenza, we can then compare the sets 
of co-morbidities and look for conditions enriched with respect to one of the diseases but 
not the other. Figure [4] (left) shows the distribution of co-morbidites with seasonal and 
pandemic influenza. As can be seen these distributions are long-tailed and can be modeled 
with power-law distributions. The results of the power-law fits are also shown in Figure [4] 
(left). 

For a particular health condition, an important measure of the risk of being infected 
by seasonal versus pandemic influenza for people who have had this condition is the ratio 
of the number people who have had both that condition and seasonal influenza, i.e. the 
co-morbitity with seasonal flu, to the total number of people who have had the condition, 
i.e. the sum of the co-morbidities with seasonal and pandemic flu. We have plotted the 
distribution of these ratios in Figure [4] (center, blue). As can be seen its shape has the 
self-similar structure of interest to us. From the discussion so far one might be tempted to 
model this distribution as the convolution of the power-law distributions modeling the two 
sets of co- morbidities. The result of this attempt is shown in Figure (center, green). The 
graph shows that in this case the convolution is not a good model because the empirical 
ratios are shifted to the left, wheres the convolution is not. In Figure [4] (right) we have 
plotted the pairs of co- morbidities for all conditions. The Spearman correlation coefficient 
for the two sets is 0.83 and linear regression shows that the co-morbidities for pandemic 
influenza are 1.3 times the corresponding co-morbidities for the seasonal influenza. Hence 
one might suppose that the discrepancy is due to the fact that the pairs of co-morbidities 
are not independent - the convolution defined above assumes that the two distributions 
are independent. 




Figure 4: Comparing the co-morbidity of various conditions with the 2009 H1N1 pandemic 
versus seasonal influenza. 

To avoid this obstacle we reconsidered our model for the distribution of co-morbidities 
and asked the following question: what is the source of the long-tail of this distribution? 
Our stipulation is that 1) for a fixed pair of diseases the co-morbidity is Poisson distributed, 
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if you observe it at different time points; 2) the means of these Poissons vary from pair to 
pair of diseases; and 3) the distribution of these means is long-tailed. These stipulations 
are supported by the data in the electronic health records. Furthermore, it is not hard to 
show that the mixture of Poissons which have a power-law with an exponent a distributed 
means is a distributions which has a power-law with an exponent a distributed tail (see 
Appendix C). We use this observation to model the long-tail distribution of the two sets of 
co-morbidities. In Figure [4] (left, black) we have plotted the result of a mixture of Poissons 
with power-law distributed means. 

Next we claim that the observed distribution of ratios is a mixture of convolutions 
of pairs of Poissons where the mixing is with the same power-law distribution used for 
the distribution of co-morbidities. More precisely, let's say that the co-morbidity of a 
fixed condition with seasonal influenza is Poisson with mean A s and its co-morbidity with 
the pandemic strain is Poisson with mean X p . From our observation on the dependance 
between the two sets of co-morbidities, we can say that X p = "fXs for some 7. Hence the 
risk ratio of this condition with the two kinds of influenza will be distributed according to 
the convolution of the two Poissons, which we denote with R\ 3 . Since the mean of R\ s is 
X s / (A s + X p ) = 1/(1 + 7) ( see Appendix B), for 7 7^ 1 this mean will be shifted away from 
1/2 depending on 7. Our model of the distribution for pairs of co-morbidites is a power-law 
mixture of distributions choosing the two co-morbidities independently according to two 
Poissons, i.e. 



f(n,m) = J g a (X)P\(n)P 7 \(m)dX, 

where <? a (A) oc X~ a . Note that although /(n, m) is not a product distribution, i.e. its 
marginals are not independent, it is a mixture of such distributions. Finally, the distribu- 
tion of risk ratios is given by 

OO OO / \ poo 

Figure [4] (center, green) shows the result of these considerations. We observe a good fit 
between the empirical distribution to the right of 1/2 and the new model and the predicted 
overall shift of the model to the left. The apparent discrepancy between the empirical and 
the theoretical model shows that a further investigation of the model is necessary. Since 
the goal of this note is to give examples of and draw attention to the interesting self-similar 
distributions appearing in empirical data, rather than to explore one particular example in 
detail, we leave the further analysis of the distribution of co-morbidities and the risk ratios 
derived from them to a future work. 
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4 Closed Form for the Convolution 



As a step towards understanding the mathematical properties of functions over the rational 
numbers in the unit interval obtained as the convolution of functions over the natural 
numbers, we attempted to obtain a closed form, i.e. in terms of known functions, for some 
of them. Ideally, given the considerations above, it would be interesting to obtain a closed 
form for the convolution of two negative binomials or two Poissons. Unfortunately we were 
not able to obtain a closed form in those cases. Since a negative binomial is a sum of 
geometric distributions, let us consider the ratio of two geometrically distributed random 
variables. More generally, we will consider a power-law with exponential cut-off of which 
the exponential distribution (the continuous analogue of the geometric distribution) is a 
special case. Let g be the probability mass function of a variable distributed according to 
a power-law with exponential cut-off with parameters a, ft > such that ft > or a > 1 , 
i.e. 



Li Q (e-/3)' 
where 

oo 

U a {x) = ^2k~ a x k . 
k=l 

is the polylogarithm function. In particular 

Li a (l) = ((a) and Liofx -1 ) = — 

Then 



c g , g (a/(a + b)) = J29(ta)g(tb) = { > ^ 



Power-law Take ft = and a > 1. Then 



CgM {a + b)) = ^ ) {ab)- a . 
Exponential Take a = 0, ft > 0. Then 

(e 13 - l) 2 
c g , g (a/(a + b)) = e/9(a+6) _ 1 - 

Uniform Although this example does not present a distribution appearing naturally in 
the discussion above, we believe it is fundamental enough to mention here. Furthermore, as 
discussed in Appendix A, this example is related to Thomae's function, because a certain 
infinite analogue of it has the same fractal dimension. 
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For a natural number L let fi be the probability mass function which is uniform on 
the set {1,2, ... ,L}, i.e. 



1/L, fc€{l,2,...,L} 
0, o/w. 



Then 



1 f 

c /L,/L( a /( a + & )) = J2fUta)f L (tb) = — mint 
t=i 



-^minj 


L 




L 




L 


a 




"5 


K 


max(a, 6) 



Thomae's function 



f T (a/(a + b)) = l/(a + b). 

This function, supported on the rational numbers in the unit interval, is not a dis- 
tribution. It is a classic example of a function which is constant almost everywhere and 
yet discontinuous on a dense set. It can be beautifully interpreted as the view from the 
corner of Euclid's orchard - an imaginary orchard which contains a tree at every point 
with integer coordinates. Although it probably is not the convolution of functions over the 
natural numbers, the fact that versions of it appeared in our empirical data was a pleasant 
surprise to us and one of the main motivations for this study. In Appendix A we will show 
that the graph of this function has a fractal dimension 3/2. 



5 Conclusions 

We have presented a set of self-similar distributions supported on the rational numbers 
in the unit interval. These functions appear pervasively in the analysis of large datasets 
when models for the distribution of ratios of natural numbers are required. The examples 
presented in this manuscript are drawn from next-generation sequencing data obtained as 
part of a study on the identification of somatic mutations, on one hand, and understanding 
disease co-morbidity as it is reflected in electronic clinical data, on the other. One can 
envisage further applications in clinical and biological settings in which the estimation of a 
frequency or ratio is necessary. Such examples are provided by the detection of subclonal 
populations in tumor samples, e.g. as part of a study on resistance to chemotherapy; the 
study of quasi-species and intrahost viral populations, e.g. in HIV and influenza; and 
studies of drug effectiveness, populations at risk in a pandemic, and other topics in clinical 
research approachable through the analysis of risk ratios. We hope that our presentation 
will stimulate further study of the functions presented here and provide a bridge between 
interesting theoretical work and important clinical applications. 
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A Fractal Dimensions 



The distributions we have found in the examples above present a self-similar fractal struc- 
ture. Distributions are normalizable by definition, i.e. the sum of all segments of figure 
1 should be equal to 1. We are interested in calculating the fractal dimensions of more 
general non-normalizable functions. More precisely, given a function / : <Q U — > M, define 
G{f) be the set of line segments in the plane from (q,0) to (q,f(q)) for q 6 Q u . We are 
interested in dim £?(,/), the fractal dimension of the set G(f). 

If / forms a probability distribution on Q u , then ^geQ /(<?) = 1 < oo and so 
dimG(/) = l. 

For a given a > 1, let f a : Q u — >■ M 

f a (a/(a + b)) = (ab)- a . 

From the discussion in Section 4 follows that for a > 1 

C 2 (a) 



C(2a)' 

Hence, in this case, dim G{f a ) = 1. It will be interesting to obtain dim G(fi). The following 
calculations from [BCFM98J should help in obtaining this dimension. 
Let fx '■ Qu — s- M be Thomae's function 

f T {a/(a + b)) = {a + b)- 1 . 

We will show that dim G(/t) = 3/2. Since max{a, b} = Q(a+b), one can think of Thomae's 
function as the infinite analogue of the convolution of the uniform distribution on {1, . . . , L} 
extended to L = oo. 
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Let F n be the n-th Farey sequence, i.e. F n = {xq = < X2 < • • • < x mn = 1} is the 
sequence of all rational numbers X{ = ai/(ai + hi) = ai/ci G Q«, such that a% and Cj < n, 
sorted in increasing order. Let A$ be the area of the trapezoid between the x-axis and 
the line segment with points (xj_i, /^(xi-i)) and (xj, fr(xi)). Then 

2A« = (/rfc-i) + /t(^))(** - Xi_i) = C/ - J + ' ' 



2 2 ' 
C tl C f 



where we use that Xj — Xj_i = l/cj_iQ. 

Let A n = J2T=i A n ] be the area under the piece-wise linear curve with points from F n . 
We will calculate A n — A n _\ for n > 3. Consider two consecutive members aj_i/cj_i and 
Oj/cj of F n _i, which have an element yj = (aj_i + aj)/(cj_i + Cj) of F n inserted between 
them. Then Cj_i + q = n and 

2(A® - A® - AV+V) =— Ci ~ X + U - n + Ci = 1 

1 n ~ X n n ' ctA cUn 2 n 2 c? a-tan 

For every n > a > if d = (a, n) there exist unique < nl < n and < a' < a such that 
d = (a',n'), n'a — a'n = d 2 , a' < n', and a" = a — a' < n — n' = n". If a/n £ Q u — {0, 1}, 
then (a, n) = 1 and we have that a'/n',a"/n" £ F n -\ are consecutive and a/n £ F n is 
inserted between them. Hence 

A -A V 1 - 1 V 1 - 1 V '--^ 

2 n'n"n In ^-^ c(n — c) n z ^-^ c n z 

a=l c=l c=l 

(a,n)=l (c,n)=l (c,n)=l 



where we let 



G 



n 

•= E 7- 



(c,n)=l 

Since A2 = 1 and lim^oo = we obtain that 



^— ' n z ^— ' n- 

n=2 n=fc+l 



Since 



E**-E E E = 

b|n 6|n c=l d| n c=l c=l 

(c,o)=l (c,n)=o 

where -ff n is the n-th harmonic number, from Mobius inversion follows that 

nG n = ^ u(n/b)bH b . 

b\n 
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We are ready to obtain an asymptotic expression for A k . Namely 

OO 1 , ,,n CO , CO TT OO , > OO TJ 

n 2 ^ nib b 2-^ c n 2 2^ c 3 (p. 

n=k+l b\n ' c=l n=k+l c=l d=|-(fc+l)/c] 

c|n 

~//(c) ~ hid ~/z(c) /• 0O lnx^ 

^ c 3 ^ d 2 ^ c 3 7 fr / r x 2 

c =i d=r(fe+i)/ci c=i • yfc / c 

In A; 

jfe ' 



Let 

e fc = min{xj - Xj_i} = l/fc(/c - 1), 

i 

where the minimum is over the elements of F k . We need 

N k = Q(A k /e 2 k ) = Q(kHnk) 
squares of size e k to cover the set G(/t)- Hence 

dimG(/ T ) = lim = 3/2. 

k— >co m £ k 

Let = {yo = < j/2 < ■ ■ ■ < Um k = 1} be the sequence of rational numbers 
x = a/(a + b) £ Q u , such that a,b < k, sorted in increasing order. Using similar arguments 
as above we can show that the length L ajk of the curve with points (yi, f a (Vi)) satisfies 

i (*'<'-> -fc'-) log* 

^=£w c(2)(1 _ a) • 

(o,6)=l 

Let ^4 ai fc be the area under the curve with points (yi, f a (Vi))- Furhermore, let 5 k = minjjyj— 
yi-i} = 0(/c~ 2 ) and N ajk be the number of squares of size 5 k necessary to cover G(f a ). 
Since N a ^ k = &(A ayk /5 k 2 ) = VL(5 k L^ k / 8 k 2 ) we obtain that for a € [0, 1] 

dimG(/ a ) = lim logN «> k > 2 - a 

k^co log Ofc 

We believe that this lower bound is an equality 
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B Mean of the Convolution 



In this section we compute the mean of the convolution of distributions on the natural 
numbers. Consider two distributions /, g : N — > R and their convolution Cf :9 : Q u — > R 



oo 



c ftg (a/(a + b)) = ^f(ta)g(tb). 



Define ipf j9 : 



t=i 



^-^ ^-^ m + n 

n=l m=0 



Notice that the mean of the convolution is exactly </>/, ff (0). 
We have that 

( E sWn = */(*)**(*)> 

\n=0 / \m=0 / 

where X/ an d X# are the moment generating functions of / and g correspondingly. 
If / = g, then solving the differential equation <p'j ^ = x'jXf we obtain that 

<Pf,f = (V2)4 

and so the mean of the convolution of a distribution on the natural numbers with itself is 

^/(0) = 1/2. 

It is interesting to obtain a similar result not assuming the equality of / and g. For 
now we present a proof that if / is Poisson with mean A and g is Poisson with mean /i, 
then the mean of their convolution is A/ (A + jj). The proof uses the fact that the moment 
generating function of a Poisson with mean A is 



Xf(t) 



e 



A(e'-l) 



Combining this fact with the observations above we obtain 

rtJt) = AeV^ 6 '- 1 ). 
Solving this differential equation gives that 

u>t (t) - A e (A+M)(e«-l) 

Therefore the mean of the convolution of two Poissons is 

<Pf,gM = 



A + /x 
as claimed. 
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C Mixing Poissons 



For a > 1 let M a be a mixture of Poissons with power-law with exponential a distributed 
means, i.e. 



M a (k) = / x k ~ a e- x dx. 



For k » a — 1 we have that 



Ma (k) = (°-i)r(fc-q + i,D „ fc -« 
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